100 research outputs found

    A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models

    Full text link
    Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records (EHRs) data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic EHR data. The results further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context

    PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations

    Get PDF
    Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at https://sites.google.com/site/pcadmix/home), a Principal Componentsbased algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping

    Identifying and ranking potential driver genes of Alzheimer\u27s disease using multiview evidence aggregation.

    Get PDF
    MOTIVATION: Late onset Alzheimer\u27s disease is currently a disease with no known effective treatment options. To better understand disease, new multi-omic data-sets have recently been generated with the goal of identifying molecular causes of disease. However, most analytic studies using these datasets focus on uni-modal analysis of the data. Here, we propose a data driven approach to integrate multiple data types and analytic outcomes to aggregate evidences to support the hypothesis that a gene is a genetic driver of the disease. The main algorithmic contributions of our article are: (i) a general machine learning framework to learn the key characteristics of a few known driver genes from multiple feature sets and identifying other potential driver genes which have similar feature representations, and (ii) A flexible ranking scheme with the ability to integrate external validation in the form of Genome Wide Association Study summary statistics. While we currently focus on demonstrating the effectiveness of the approach using different analytic outcomes from RNA-Seq studies, this method is easily generalizable to other data modalities and analysis types. RESULTS: We demonstrate the utility of our machine learning algorithm on two benchmark multiview datasets by significantly outperforming the baseline approaches in predicting missing labels. We then use the algorithm to predict and rank potential drivers of Alzheimer\u27s. We show that our ranked genes show a significant enrichment for single nucleotide polymorphisms associated with Alzheimer\u27s and are enriched in pathways that have been previously associated with the disease. AVAILABILITY AND IMPLEMENTATION: Source code and link to all feature sets is available at https://github.com/Sage-Bionetworks/EvidenceAggregatedDriverRanking

    Global effects of DNA replication and DNA replication origin activity on eukaryotic gene expression

    Get PDF
    This report provides a global view of how gene expression is affected by DNA replication. We analyzed synchronized cultures of Saccharomyces cerevisiae under conditions that prevent DNA replication initiation without delaying cell cycle progression. We use a higher-order singular value decomposition to integrate the global mRNA expression measured in the multiple time courses, detect and remove experimental artifacts and identify significant combinations of patterns of expression variation across the genes, time points and conditions. We find that, first, ∼88% of the global mRNA expression is independent of DNA replication. Second, the requirement of DNA replication for efficient histone gene expression is independent of conditions that elicit DNA damage checkpoint responses. Third, origin licensing decreases the expression of genes with origins near their 3′ ends, revealing that downstream origins can regulate the expression of upstream genes. This confirms previous predictions from mathematical modeling of a global causal coordination between DNA replication origin activity and mRNA expression, and shows that mathematical modeling of DNA microarray data can be used to correctly predict previously unknown biological modes of regulation
    corecore